Skip to content

GH-45601: [R] R arrow cannot handle labelled data in arrow tables #46431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

thisisnic
Copy link
Member

@thisisnic thisisnic commented May 13, 2025

Rationale for this change

There is a bug where we end up crashing when working on labelled columns in table

What changes are included in this PR?

Remove labels from columns

Are these changes tested?

Yes

Are there any user-facing changes?

Yes

Draft PR - this works for tables but not datasets yet

Copy link

⚠️ GitHub issue #45601 has been automatically assigned in GitHub to PR creator.

@thisisnic
Copy link
Member Author

@amoeba I tried the approach you suggested here but because we use as_arrow_table() internally in a lot more functions, we end up breaking roundtripping with Feather etc.

I think if we work only in R, we would want to remove the label and then restore them later, but trying to find an uncomplicated way of doing this.

I think we definitely want to stop the segfault regardless and error instead.

Users technically can use mutate() to change the type to something we can work with, but there'll be resource costs with doing this on a dataset. See my reprex below.

library(haven)
library(arrow)
library(tibble)
library(dplyr)

d <- tibble(
  a = labelled(x = 1:5),
  b = labelled(x = 11:15)
)

tf <- tempfile()
write_parquet(d, tf)

# still fails
read_parquet(tf, as_data_frame = FALSE) %>%
  filter(a > 3) %>%
  collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! NotImplemented: Function 'greater' has no kernel matching input types (<labelled<integer>[0]>, <labelled<integer>[0]>)
tf <- tempfile()
write_parquet(d, tf)

# works
read_parquet(tf, as_data_frame = FALSE) %>%
  mutate(a = as.integer(a)) %>%
  filter(a > 3) %>%
  collect()
#> # A tibble: 2 × 2
#>       a b        
#>   <int> <int+lbl>
#> 1     4 14       
#> 2     5 15
# fails
open_dataset(tf) %>%
  mutate(a = as.integer(a)) %>%
  filter(a > 3) %>%
  collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! NotImplemented: Function 'greater_equal' has no kernel matching input types (<labelled<integer>[0]>, <labelled<integer>[0]>)
# works but potentially higher resource usage
open_dataset(tf) %>%
  mutate(a = as.integer(a)) %>%
  compute() %>%
  filter(a > 3) %>%
  collect()
#> # A tibble: 2 × 2
#>       a b        
#>   <int> <int+lbl>
#> 1     4 14       
#> 2     5 15

@thisisnic
Copy link
Member Author

I've stopped it segfaulting on printing, but I think the actual fix needs to be more layers deep.

@thisisnic
Copy link
Member Author

I'm also wondering if instead of supporting this we should just stop the segfault and then error appropriately and recommend folks do something like:

open_dataset(whatever) %>%
   mutate(col = cast(col, int32()) %>%
   write_dataset(newlocation)

open_dataset(newlocation) %>%
  filter(col > 3) %>%
  collect()

Otherwise we're getting into the territory of supporting compute functions on extension types, which we don't actually do and if implemented should be done lower down the stack anyway.

More discussion on computing on extension types here: https://lists.apache.org/thread/2j61nrod7x0s5vjhc6q9tlj898drz7rn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant